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The dynamics of individuals is of essential importance for understanding the evolution of social 
systems. Most existing models assume that individuals in diverse systems, ranging from social 
networks to e-commerce, all tend to what is already popular. We develop an analytical time-aware 
framework which shows that when individuals make choices—which item to buy, for example— 
in online social systems, a small fraction of them is consistently successful in discovering popular 
items long before they actually become popular. We argue that these users, whom we refer to as 
discoverers, are fundamentally different from the previously known opinion leaders, influentials, and 
innovators. We use the proposed framework to demonstrate that discoverers are present in a wide 
range of systems. Once identified, they can be used to predict the future success of items. We propose 
a network model which reproduces the discovery patterns observed in the real data. Furthermore, 
data produced by the model pose a fundamental challenge to classical ranking algorithms which 
neglect the time of link creation and thus fail to discriminate between discoverers and ordinary 
users in the data. Our results open the door to qualitative and quantitative study of fine temporal 
patterns in social systems and have far-reaching implications for network modeling and algorithm 
design. 


I. INTRODUCTION 

The digital age provides us with unprecedented 
amounts of information about our society. The collected 
data are increasingly available at fine temporal resolution 
which permits to progress from rudimentary mechanisms 
in complex systems, such as preferential attachment 
[3] , to their refined versions where the fitness of individual 
nodes and aging play a fundamental role HE]. In a sim¬ 
ilar way, the Poisson assumption of the distribution of 
human activity in time [6] has been replaced with mod¬ 
els based on a priority queuing process [7] and a cascad¬ 
ing non-homogeneous Poisson process [8] that correctly 
reproduce the observed bursts of activity [S[IQ|. Bigger 
and better data continue to foster our understanding and 
modeling of the human behavior. 

We focus here on data produced by various online sys¬ 
tems where users acquire items: buy products, borrow 
DVDs, or watch videos, for example. This kind of data 
is at the center of attention of the recommender systems 
community which aims at predicting items that an in¬ 
dividual user might appreciate The user-item 

data can be represented and modeled by a growing net¬ 
work where users are connected with the collected items 
[m [16] . Preferential attachment assumes that the rate 
at which items attract new connections from users is pro¬ 
portional to the number of connections that items already 
have [2|. Models based on preferential attachment have 
been applied in a wide range of systems HZ]. However, 
all models to date consider a homogeneous population 
composed of users driven by item popularity which is 
modulated by item fitness or aging or both in more elab¬ 
orate models HUHllIS]. 

We develop here a statistical framework based on data 
with time information and a new metric, user surprisal, 
to show that users in social systems are essentially hetero¬ 
geneous in their collection patterns: while the majority 


of users are subject to preferential attachment and usu¬ 
ally collect popular items, some users frequently attach 
to little popular items that at the same time eventually 
become hugely popular. We focus on the latter group 
of users and suggest a criterion to select those of them 
who are statistically significant—we refer them as dis¬ 
coverers here. We use our framework to find discover¬ 
ers in data from a number of real systems and illustrate 
that the identified discoverers can be used to predict the 
future popular items. The success of discoverers can¬ 
not be due to the fact that they act as opinion leaders 
or influentials [2QII24] because the possibility for users 
to directly influence each other is absent in most of the 
datasets studied here. By contrast to innovators who act 
in the first stage of innovation diffusion [25|, discoverers 
are distinguished by consistency with which they achieve 
discoveries and by not relying on their social status or 
social contacts. In other words, the discoverers discussed 
here are a genuinely new component in the much-studied 
diffusion of innovations [25ll27] and the evolution of com¬ 
plex systems iniiiH]. 

To provide a possible explanation for the observed col¬ 
lection patterns of discoverers, we generalize a recent net¬ 
work growth model HES] by assuming that there are two 
kinds of users: those who are driven by item popularity 
and those who are driven by item fitness. Since high fit¬ 
ness items often become very popular (though, similarly 
as in real systems EOIEI], the correlation is not perfect) 
and fit ness-driven users are often among the first users 
who collect these items, the model produces similar dis¬ 
covery patterns as those observed in the real data. We 
provide basic analytical results for the model and study 
its dependence on model parameters. 

Finally, we demonstrate that the artificial data gener¬ 
ated by this model contradict the score-feedback mecha¬ 
nism used by many ranking algorithms on networks [32| , 
of which PageRank [33l (34] and HITS (Hyperlink-Induced 
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Topic Search) [35] are the prime examples. We show 
that these algorithms, which only act on a static snap¬ 
shot of the system, consequently fail to individuate the 
fitness-sensitive users in model data. Our surprisal met¬ 
ric, although not devised to rank the users, overcomes 
the problem by taking the chronology of link creation 
into account. Our observations point out for the first 
time that classical ranking algorithms on networks may 
be inadequate for a broad class of systems. The only way 
to overcome this inadequacy is to develop new algorithms 
motivated by and benefiting from temporal patterns in 
real data. 

The paper is organized as follows. In Section [IT] we 
introduce the statistical framework for the identification 
of discoverers. In Section in we apply the framework 
on datasets from two real online systems and show that 
they both feature strong support for discoverers. In Sec¬ 
tion [IV| we present a network growth model which repro¬ 
duces the discovery patterns observed in the real data. 
We also show here that while the classical ranking algo¬ 
rithms fail to discriminate the users in model data, our 
new framework performs well in this respect. 

II. STATISTICAL PROCEDURE 

We assume that the input data have a bipartite struc¬ 
ture where there are U users, I items, and L links that 
always connect a user and an item. We label the users 
with Latin letters (i, j,...) and the items with Greek let¬ 
ters (a, /3,...) to make the notation more transparent. 

A. Discoveries and user surprisal 

To find the users who act as discoverers of highly pop¬ 
ular content, we devise a simple yet effective procedure. 
We choose a small fraction of the most popular items 
and track the users who are among the first Nd users 
connecting with them; here is a small parameter. 
We label these early links as discoveries of the eventu¬ 
ally popular content. The number of links created by 
user i and the number of thus-achieved discoveries are 
denoted by ki and respectively. 

To evaluate whether a user under- or outperforms in 
making discoveries, we formulate the null hypothesis Hq 
that all users are equally likely to make a discovery by 
each collected item. Denoting the total number of discov¬ 
eries and links as D = di and L = ki^ respectively, 
the probability of discovery for each individual link under 
Hq is pd{Ho) = DjL. Under the null hypothesis, discov¬ 
eries are independent and equally likely—their number 
for any given user is thus driven by the simple binomial 
distribution. This allows us to compute the probability 
that user i makes at least di discoveries as 

P°{di\ki,pD, Ho) = ^ (1) 

n=di ^ ^ 


By summing up over di discoveries or more, we make 
sure that the probability can become very small only 
if the user makes too many discoveries, not too few, in 
comparison with the user’s degree ki. Note that the ex¬ 
pected number of discoveries of user i is {di) = po^i and 
the total expected number of discoveries is therefore 

Y^{di) = Y.^h = D. ( 2 ) 

i i 

The binomial distribution for the number of discoveries 
by individual users and the real number of discoveries are 
thus compatible with each other. Note that the null hy¬ 
pothesis effectively decouples the users whose discoveries 
are assumed to be independent of the discoveries made 
by the others. While this is not strictly true on a link-by- 
link basis—a user sometimes creates a link at a moment 
when there are no discoveries possible—it still holds for 
each user overall because every user makes several links 
and, moreover, they are free to choose the time when the 
links are made. 

To quantify the extent to which is the behavior of user i 
incompatible with the null hypothesis, we introduce user 
surprisal (also referred to as self-information [36] ) 

Si := - InP°{di\ki,pD, Ho). (3) 

The higher the surprisal, the more unlikely the user un¬ 
der Hq. The lowest possible surprisal value = 0 and 
the highest possible surprisal value Si = —kilnpjj are 
achieved when Di = 0 and Di = ki^ respectively. Al¬ 
beit the proposed procedure to compute user surprisal 
is not well adapted to the extreme case of a user who 
collects all items, we can consider it as an instructive ex¬ 
ample. A user who collects all items as the first one (for 
example by setting up an automaton that periodically 
checks the system and collects any new items that ap¬ 
pear) naturally makes many discoveries—each link to an 
items which eventually ends among the fnl most popu¬ 
lar items counts as one discovery. At the same time, the 
number of discoveries expected under the null hypothesis 
is pdI- When po > /d, the actual number of discoveries 
does not exceed the expectations under the null hypoth¬ 
esis and the user’s surprisal value is thus small. The 
conclusion is that as long as pd is greater than fn, even 
collecting each item as the first one is no guarantee of 
achieving high surprisal. 


B. The bootstrap analysis 

To evaluate whether a user’s discovery behavior is com¬ 
patible with the null hypothesis, we use parametric boot¬ 
strap m- Using the discovery probability we gener¬ 
ate the number of discoveries under Hq for each user ac¬ 
cording to Eq. compute the corresponding bootstrap 
surprisal value, and consequently compute the largest 
bootstrap surprisal value found for any of the users. By 
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repeating this procedure many times (we use 10,000 in¬ 
dependent bootstrap realizations), we find the average 
largest surprisal value in bootstrap (S'max)* 
whose real surprisal is higher than this value is referred 
to as discoverer] the number of discoverers is labeled as 
Ud- In addition to this quantity, bootstrap is used to 
obtain Zipf plots of bootstrap surprisal values which are 
used for comparison with Zipf plots of real surprisal val¬ 
ues (given a set of values, a Zipf plot is a plot of the 
logarithm of a value rank versus the logarithm of the 
value itself [38]). 

In each bootstrap realization, there is some non-zero 
probability that some users reach higher surprisal than 
(^max)- To find this “null level” of discoverers, one can 
apply the same procedure to bootstrap surprisal values: 
all users whose bootstrap surprisal is higher than (5'max) 
are classified as discoverers and their number is denoted 
by (the superscript 0 stands for the null level). Sim¬ 
ulations show that reaches small values (of the order 
of one) for the two investigated datasets. We can con¬ 
clude that this null level of discoverers is of little practical 
significance and one can omit it in the surprisal analysis 
and computation of Ud- 


III. REAL DATA ANALYSIS 

To test the proposed procedure, we use data on DVD 
purchases at Amazon.com and personal bookmark collec¬ 
tions at Delicious.com. See Supplementary Information 
(SI) for results on four additional datasets. 

A. Data description 

Amazon DVD review data were obtained from snap. 
stanford.edu/data/web-Movies.html [39|. After data 
cleaning (merging distinct items which actually corre¬ 
spond to the same product—different releases of a DVD 
are the typical example of this phenomenon—and remov¬ 
ing duplicate reviews), there are 1,901,110 reviews in the 
integer scale 1-5 from 889,066 users for 141,039 items. 
While the data span 5,546 days (August 1997-October 
2012), we only use the data from days 2,000 to 5,000 
because the rest of the data shows comparably low activ¬ 
ity of users. To obtain an unweighted bipartite network, 
we neglect all reviews with rating 3 or less and repre¬ 
sent all reviews with rating 4 or 5 as links between the 
corresponding user and item. After this operation, there 
are 713,581 links whereas 406,275 users and 76,205 items 
have at least one link. 

Delicious.com is a web site that allows users to store, 
share, and discover web bookmarks. Delicious book¬ 
mark collections were obtained by downloading publicly- 
available data from the social bookmarking website 
delicious.com in May 2008. Due to processing speed 
constraints, we randomly sampled 50% of all users avail¬ 
able in the source data and included all their bookmarks. 


To avoid the possible ambiguity of various web addresses 
pointing to the same web page, reduce the number of 
items and thus increase the data density, bookmarks 
are represented only by their base www-address without 
the initial protocol specification, possible leading “www.” 
and the trailing slash (e.^., http://www.edition.cnn. 
com/us/ is modified to edition.cnn.com); each www- 
address is then represented as an item-node and con¬ 
nected with the users who have collected it. Time stamps 
are counted in hours from 01/09/2003 and run from 
0 to 36,027. For the same user activity reasons as in 
Amazon, we only use the data from hours 15,000 to 
35,000. There are 107,810 users, 2,435,912 items and 
9,322,949 links in the resulting data. We have ana¬ 
lyzed also data where the full address hierarchy is pre¬ 
served (e.^., edition, cnn. com/us instead of the previ¬ 
ously mentioned edition.cnn.com) and found the same 
behavior as presented here. 


B. Results on real data 

Figure shows the discovery patterns and user sur¬ 
prisal in the real datasets. Panels EK and compare 
the linking patterns of two Amazon users of different sur¬ 
prisal. The “ordinary user” either collects popular items 
late or collects unpopular items and thus achieves no dis¬ 
coveries. By contrast, the “user with many discoveries”, 
though only active later during the dataset’s timespan, is 
frequently among the first to collect eventually popular 
items and achieves 59 discoveries in 283 links whereas the 
overall discovery probability is pjj ^ 0.5% which for the 
given number of links corresponds to 1.4 discoveries on 
average. Panels EP and EP further show the degree and 
surprisal values in the analyzed data. While the maximal 
possible surprisal value of an individual user grows lin¬ 
early with user degree (depicted with dashed lines), user 
activity alone is no guarantee of high surprisal and top 
surprisal values are achieved by some moderately active 
users (see Tab. S2 for the list of users with the highest 
surprisal values in the two datasets). Finally, one can 
see here that when the number of discoveries is fixed, the 
surprisal value decreases with user degree. 

Results of the bootstrap analysis in Figure show that 
the largest surprisal values in bootstrap realizations sam¬ 
pled under Hq are never as high as the largest surprisal 
in real data. For /^ = 1% and Nd = 5, there are 49 and 
525 identified discoverers in the Amazon and Delicious 
data, respectively (0.01% and 0.49% of all users, respec¬ 
tively). The highest surprisal values correspond to the 
probabilities 10“^^^ and 10“^^ for the Amazon and 
Delicious data, respectively. The same kind of discovery 
behavior in four additional data sets is reported in Fig. 
S2. The SI further demonstrates that there is no partic¬ 
ular time bias in the discovery patterns (e.g., discoverers 
are not those who happen to be active earlier or longer 
than the others) and the discoveries are made contin¬ 
uously during the system’s lifetime (Figures S3 and S4, 
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FIG. 1. Discoveries and user surprisal in real data. (A, B) A comparison of the linking patterns of an ordinary user and a 
“discoverer” in the Amazon data (see Fig. SI for results on the Delicious data). Each bar here corresponds to a collected object 
where the red and blue part show item popularity at the time when the user collected it and the final popularity, respectively. 
Green circles mark those collected items that are eventually identified as discoveries (see the definition in text). (G, D) Scatter 
plots of user degree and user surprisal in the Amazon and Delicious data. Users are color-coded according to their number of 
discoveries. The dashed lines mark the maximal achievable user surprisal at a given user degree. All results are for = 
and Nd = 5. 




FIG. 2. Zipf plots of user surprisal in real data and in bootstrap. All results are for — and = 5. 


respectively). While numerical values of surprisal depend 
on parameters fu and , the resulting ranking of users 
by their surprisal is rather stable (see Fig. S5). Figure 
S6 finally demonstrates that the ranking of users by their 
surprisal does not change considerably when part of the 


data is taken into account. We can conclude that the 
null hypothesis of user homogeneity needs to be rejected 
because some users are indeed significantly more success¬ 
ful than the others in early collecting eventually popular 
items. This phenomenon is not restricted to particular 































































5 


25 


20 


0 
c/) 

03 
0 
i— 

o 

15 

d) 

0 

® 10 


c 

0 

0 


Amazon 


all users 
zero surprisal 
surprisal < 10 
surprisal > 10 


50 100 150 200 250 300 

days in the future 



FIG. 3. Future degree evolution of the target items collected by users of different surprisal in Amazon (A) and Delicious (B) 
data. The items collected by high surprisal users become significantly more popular (according to the Mann-Whitney test) 
than target items collected by users of low or zero surprisal. The popularity ratio between the high and zero surprisal group at 
the end of the future time window is 10.0 and 6.1 for the Amazon and Delicious data, respectively. 


conditions and emerges consistently in systems where in¬ 
dividuals are free to choose among many heterogeneous 
items. 

We next investigate whether the presence of users who 
make discoveries more often than the others is of some 
practical significance. To this end, we generate multi¬ 
ple data subsets and in each of them define young items 
with exactly one link as the target items whose future 
popularity is to be predicted (see SI for details). Since 
the information on these items is extremely limited and 
the social network of users either absent in the studied 
systems or not known to us, traditional methods for pre¬ 
diction of popularity of online content cannot be used 
here [4QH42] . We divide users in each subset into three 
groups: zero, low, and high surprisal users (the thresh¬ 
old between low and high surprisal is set to 10 which is 
close to the average highest surprisal value in bootstrap in 
both data sets). The data that come after a given subset 
are then used to evaluate the future degree evolution for 
the target items collected by users from different groups. 
Figure demonstrates that the target items chosen by 
users of high surprisal become significantly more popu¬ 
lar than those chosen by users of zero or low surprisal. 
This shows that surprisal not only quantifies users’ past 
behavior but it also has predictive power. 


IV. NETWORK MODEL 

The question now is how to explain the observed collec¬ 
tion patterns of discoverers. A possible explanation lies 
in the discoverers being more influential than the other 
users which in turn leads to the items collected by them 
eventually becoming popular. However, most of the sys¬ 
tems that we analyze here lack any explicit mechanism 
for users to exert influence over the others, especially on 
such short time scales as we speak of here (we use Nd = 5 


through the paper, which means that only the first five 
users are awarded a discovery for collecting a relevant 
item). We have also data from Yelp.com, a web site for 
crowd-sourced reviews of local businesses, which is par¬ 
ticular for comprising both bipartite user-item data and 
an explicit social network of users. However, we find no 
correlation between the number of friends and user sur¬ 
prisal which indicates that even when explicit influence 
can be exerted, it is not sufficient to explain the behav¬ 
ior of discoverers (see SI, Section S3, for details). This 
agrees with the finding that easily influenced individuals 
contribute to the rise of exceptionally popular items more 
than so-called influentials [22]. In the Amazon data, we 
also have the information on the number of users who 
find a review useful, which allows us to study the possi¬ 
ble correlation between the average level of usefulness of a 
user’s reviews and the user’s surprisal value. However, we 
find no significant correlation which suggests that well- 
written and informative reviews do not contribute to the 
success of discoverers. 

Motivated by these observations as well as by the pres¬ 
ence of discoverers across many different systems, we pro¬ 
pose an intrinsic mechanism to explain the observed dis¬ 
covery patterns. We first assume that some items are 
inherently more fit for a given system than the others 
and thus have higher chance of becoming very popular 
in the long run. Network models with node fitness have 
been studied in the past nnumiii] and they have been 
used to model various systems such as the World Wide 
Web [45|, citations of scientific papers H ES], and an 
online scientific forum [29], for example. Unlike the ex¬ 
isting models, we then assume that the users differ in 
how they perceive item fitness and choose the items for 
their collections. While the first group of users are driven 
by item popularity and thus mostly ignore new and little 
popular items, the second group of users are driven by 
item fitness. Discoverers then emerge among the users 
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in the latter group because: (1) fitness-driven users are 
consistently among the first ones to collect items of high 
fitness, (2) high fitness items often become very popu¬ 
lar, (3) active fitness-sensitive users have the potential to 
achieve many discoveries and eventually be identified as 
discoverers by the statistical procedure that we propose 
here. 

Two groups of consumers—innovators and imitators— 
are considered also by the Bass model [21] which con¬ 
stitutes a seminal model for the diffusion of innovations. 
However, the Bass model does not consider competition 
among the items and the link between an item’s final 
popularity and its properties. Because of the focus on 
individual items, this model also does not consider the 
question whether individual users repeatedly act as inno¬ 
vators or imitators, respectively. Finally, the Bass model 
predicts a temporal exponential decay of the number of 
links received by a node which disagrees with the link¬ 
ing patterns of real systems where the temporal decay is 
slower (see [5] for a quantitative study of the Bass model 
in scientific citation data and a comparison with the net¬ 
work growth model proposed in HD- We thus do not 
attempt to use the Bass model for modeling the discov¬ 
ery patterns found in real data. 

We generate artificial bipartite networks with U users 
where the number of items gradually grows from a small 
number Iq to / (we use U = 4000, Jq = 50, and / = 8000 
here). There are Up fitness-sensitive users and the re¬ 
maining U — Up users are popularity-driven. Each user 
is further endowed with a level of activity which deter¬ 
mines how likely is the user to collect a new item in any 
time step. While one can vary the distribution of ac¬ 
tivity among the users to model a broad range of real 
systems, user activity values are for simplicity drawn 
from the uniform distribution [0,1] here. Item fitness 
quantifies how suitable and attractive is an item to the 
given system and its users; fitness values fa are drawn 
from the power-law distribution with the lower bound 
/min = 1 and exponent 3. As the subsequent analyti¬ 
cal computation shows, a power-law fitness distribution 
directly translates into a power-law distribution of item 
popularity. Our choice of the item fitness distribution 
thus allows us to mimic real systems where the distribu¬ 
tion of node popularity (degree) is often broad, typically 
power-law or log-normal m- Time at which item a has 
been added in the system is denoted as Tq,. New links 
are added regularly until the final network density r] is 
achieved; the total number of links is thus L = r]UI. 
To reach I items before all links have been added in the 
network, new items are added every L/(/ — Jq + 1) steps. 

In the simulation, one user-item link is added in every 
time step. The user who creates this link is chosen from 
the pool of users with probability proportional to user 
activity. If a fitness-driven user i creates a link at time 
t, the probability of choosing item a is proportional to 

Pia^ fa^t-Ta) (4) 

where A(t — Ta) = exp[—(t — Ta)/0] is an aging factor 


(see HUS] for the original model of network growth with 
heterogeneous fitness and aging). Consequently, 6> is a 
typical lifetime at which item attractiveness decays; we 
use 0 = 1000 which is neither too quick (in which case 
the high-fitness items do not have sufficient time to at¬ 
tract many links and the resulting degree distribution is 
thus very homogeneous) nor too slow (in which case a 
strong bias towards old items develops and the fitness- 
popularity correlation is low). If a popularity-driven user 
i creates a link at time t, the probability of choosing item 
a is proportional to 

Pia ^ {ka{t) + 1) A{t - Ta) (5) 

where ka{t) is the degree (popularity) of item a at time 
t. The additive term in /cq, +1 is necessary to allow items 
of zero degree (every item is introduced in the system 
with zero degree) to gain their first links. Multiple links 
between a given user and an item are not allowed. 


A. Basic analytical results 


Denoting the fraction of fitness-driven users as fip := 
UpjU, we can write the following continuum equation 
which describes the evolution of the average degree of 
item a (see H [T7] for more details on the continuum 
approximation approach) 

d{ka{t)) ^ faA{t-Ta) 
dt 

, ^ ika{t) + C)A{t - Tg) 

^ ^^’E0{kp{t) + C)A{t-T0) 

where the two terms represent the contribution of the 
fitness- and popularity-driven users, respectively. The 
presence of the aging factor A(-) allows us to replace the 
sums in fraction denominators with their average values 
to which the sums approach at the time scale given by 
the form of A(-) and then fluctuate around them. In 
particular, we have 


^^(^/3(^) + C)A{t — Tp) 

0 

Equation (§ can now be solved analytically and yields 
the asymptotic result 


(fca(oo)) 




(8) 


where T = A(t) dt. Results for the previous model 
with preferential attachment, fitness and aging presented 
in HE] are recovered by setting jup = 0 and replacing T 
with Tfa. We see that the expected final degree of items 
is indeed proportional to item fitness (the proportionality 
factor is given by the fraction of leaders in the system). 
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FIG. 4. User surprisal and discoverers in artificial model networks. (A) The relation between item fitness and popularity as 
well as the cumulative distribution of item popularity (Uf = 600). (B) Item degree distributions for various values oillp- (C) 
The Zipf plot of user surprisal in artificial data (Uf = 0, 600) and in bootstrap (this curve is independent of Up)- (D) The 
dependence of the number of discoverers identified by the proposed statistical procedure on UpIU. For comparison, we also 
show here results for 2,000 users and 4,000 items. 


We finally note that one can devise a model with 
continuously-distributed user ability G [0,1] where 
the two aforementioned item-choosing equations can be 
merged in one. We have studied the multiplicative form 

Pia ^ fa A{t - To) (9) 

which implies that users of ability one respond only to 
item fitness, users of ability zero respond only to item 
popularity, and there is a continuous spectrum of user be¬ 
havior between these two boundary ability values. How¬ 
ever, we find the binary model with two discrete user 
groups easier to interpret and more amenable to analyt¬ 
ical solution. 


B. Results on model data 

Simulation results for the artificial model are pre¬ 
sented in Figure 4. Figure 4A shows that when a signif¬ 
icant number of users are sensitive to item fitness (here 
Up = 600), the resulting networks exhibit strong corre¬ 
lation between item fitness and popularity. As Up de¬ 


creases, this correlation gradually vanishes because we 
assume that the popularity-sensitive users ignore item 
fitness. As shown in Figure 4B, the distribution of item 
popularity is indeed rather broad and displays a power- 
law tail when Uf is positive which agrees with the ap¬ 
proximate analytical solution above. Figure 4C demon¬ 
strates that when Uf is positive, user surprisal computed 
in model data differs from the bootstrap surprisal profile 
in the same way as we have shown in Figure 2 for the real 
data. The number of identified discoverers as a function 
of the number of fitness-sensitive users is displayed in 
Figure 4D. The dependence is notably non-monotonous. 
When Uf is small, the correlation between item fitness 
and popularity is low and many of the popular items that 
are used to assign discoveries are thus of low fitness; the 
fit ness-sensitive users thus fail to achieve many discover¬ 
ies and the resulting Uf is close to zero. As Up grows, 
the fitness-popularity increases and so does Up but even¬ 
tually, there number of fitness-sensitive users is too large 
for the number of available discoveries and Up declines. 
For intermediate values of Uf, the numbers of identi¬ 
fied discoverers are significant and we can thus conclude 













































that the proposed simple model is able to reproduce the 
discovery patterns observed in real data. The observed 
fraction Ud/U which gets as high as 0.03% at Up = 300 
is similar to that found in the Amazon data. 

Note that the groups of fit ness-driven users and discov¬ 
erers are in general not the same. While in the current 
setting, all discoverers identified using the proposed sta¬ 
tistical framework are fitness-driven, only a small frac¬ 
tion of fitness-driven users are identified as discoverers 
(in Figure 4D, for example. Up = 600, yet Up ^ 10). 
There are various reasons why a fitness-driven user does 
not become a discoverer: the user is not active enough, or 
by chance becomes active at moments when there are no 
relevant items (that is, little popular high-fitness items) 
available and hence no discoveries can be made, or simply 
fails to connect with the available relevant items because 
of the probabilistic network growth mechanism. The fact 
that discoverers are found in the model data is thus not 
automatic and the number of statistically significant dis¬ 
coverers depends strongly on model parameters. 

We close with a discussion of profound implications of 
the presence of fitness- and popularity-driven users on 
node ranking algorithms in networks. The typical goal 
of a network ranking algorithm is to find the most im¬ 
portant nodes. In the context of a user-item network, 
the most important item nodes are those that have the 
highest fitness and the most important user nodes are 
those who are fitness-driven. To go beyond nodes’ local 
neighborhoods and thus benefit from the network struc¬ 
ture, these algorithms usually allow scores to propagate 
between nodes [23 EH]- In the context of bipartite user- 
item networks, this means that the user score is given 
by the score of the items that have been collected by the 
user and the item score score is given by the score of the 
users who have collected the item, as formalized by a bi¬ 
partite version [49] of the classical HITS algorithm [35] . 
The present class of model data however poses an impor¬ 
tant difficulty: Figure 5A shows that despite some users 
being more sensitive to item fitness than the others, both 
user groups ultimately collect items of similar fitness (in 
other words, the correlation between user ability and the 
average fitness of collected items is low). The problem is 
particularly pronounced for the users who have collected 
the best items on average: only 30% of the 50 users who 
are best in this respect are actually fitness-sensitive; the 
remaining 70% are popularity driven. The reason for 
this is simple: after fitness-driven users find high fitness 
items, popularity-driven users driven by popularity are 
likely to copy their choice and end up with items which 
are only marginally worse than those collected by the 
fitness-driven users. 

This observation suggests that the broad class of 
network-based algorithms may be unsuitable for net¬ 
works where preferential attachment allows ordinary 
users to effectively copy the choices made by discover¬ 
ers. To verify that, we apply a bipartite HITS algorithm 
on artificial networks and evaluate the fraction of top 50 
positions in thus-produced user ranking which is actu¬ 


ally occupied by fit ness-sensitive users. The algorithm 
assigns scores Xi and to users and items, respectively, 
that satisfy the set of equations 


= -T 

h- ^ 


Va, 


aeXi 


ya= Xi 

i&Ja, 


( 10 ) 


where Xi is the set of items collected by user i and 
is the set of users who have collected item a. In other 
words, the score of users is given by the average score of 
items collected by them and the score of items is given by 
the total score of users who collect them. The set of equa¬ 
tions is typically solved iteratively by first setting uniform 
score vectors and then recomputing x values based on 
the current y values and vice versa [50]. To prevent the 
score vectors from diverging, they need to be normalized 
after each iteration. Figure 5B confirms that bipartite 
HITS indeed performs only marginally better than ran¬ 
dom ranking of users (the fraction of fitness-driven users 
in top 50 is then the same as in the whole population, i.e. 
UfIU). We now see that the failure of this algorithm is 
in its ignorance of the time information—users who dis¬ 
cover genuinely valuable content are thus indistinguish¬ 
able from those who later copy their choice. Omitting, 
for example, the normalization with ki in Eq. (10) thus 


does not change the results significantly. To correctly 
rank users in this network, an algorithm needs to account 
not only for who has collected what but also when they 
have done so. Although surprisal has not been devised 
to rank users, we apply it in the artificial networks. Fig¬ 
ure shows that the precision of the rankings obtained 
by user surprisal exceeds that achieved by bipartite HITS 
and reaches near-perfect precision of 95% at Up = 400. 

It has been demonstrated that in real systems, the pop¬ 
ularity of items is path-dependent and sensitive to system 
design and possible external factors miEi], which ques¬ 
tions the basic premise of the proposed user surprisal 
measure which uses the most popular items as the rel¬ 
evant items for which discoveries are awarded to users. 
The analysis of model data allows us to return to this 
important point equipped with better understanding of 
both the statistical procedure and the systems on which 
it is applied. We find discoverers in the model data de¬ 
spite the fact that the correlation between item fitness 
and popularity is far from perfect (see Figure 4A) and 
almost all of the identified discoverers are indeed fitness- 
sensitive (see Figure]^ tor Up ^ 400). This high ro¬ 
bustness towards sub-optimal choice of relevant items is 
due to the fact that when some popular items are ac¬ 
tually of low-fitness, fitness-sensitive users simply ignore 
them. By contrast, the popularity-sensitive users gain 
some discoveries for these inferior popular items but since 
these users are typically in majority by a wide margin, 
thus-achieved discoveries are not sufficient to achieve sig¬ 
nificant values of user surprisal. We see that while the 
imperfect choice of the relevant items thus reduces the 
signal for fitness-sensitive, it creates only a weak false 
signal for popularity-sensitive users. 
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FIG. 5. Ranking of users in model networks. We show here results for networks with 4,000 users and 8,000 items. (A) 
Histograms of the average fitness of items collected by fitness- and popularity-driven users, respectively, overlap substantially 
{Up — 600, one network realization). (B) The performance of HITS and surprisal in ranking users according to their sensitivity 
to item fitness. Ranking precision is defined as the fraction of fitness-driven users in top 50 positions of the ranking (results 
are averaged over 10 model realizations, error bars represent the standard deviation). 


V. DISCUSSION 

In this article, we introduce discoverers as the users 
in data from real systems who significantly outperform 
the others in the rate of making discoveries, i.e. in be¬ 
ing among the first ones to collect items that eventually 
become very popular. We develop a statistical frame¬ 
work to identify the discoverers and use it to demonstrate 
that they can be found across a number of online systems 
where users have the freedom to choose from a large num¬ 
ber of possible items. The proposed approach is suitable 
to any data with time information. Evidence for discov¬ 
ery behavior in monopartite networks (work in progress) 
shows that our approach is applicable and relevant to an 
even broader range of systems than those studied here. 
The ability to identify the discoverers is shown beneficial 
for predicting the future popularity of items as well as 
for ranking the users. Motivated by the generality of the 
observed phenomenon and a lack of direct ways for an 
individual to influence other users in the systems stud¬ 
ied here, we search for a unifying mechanism to model 
the discovery behavior. To this end, we generalize the 
preferential-attachment network growth model with fit¬ 
ness and aging [4] by assuming that not only the item 
nodes differ in their fitness but also the user nodes differ 
in their sensitivity to item fitness. In the model data, 
fitness-sensitive users recognize the high fitness items, 
collect them, and these items then often eventually be¬ 
come very popular due to their high fitness. While the 
model reproduces the discovery patterns found in the real 
data, we emphasize that the main goal of the model is to 
show that the reported discovery patterns can be mod¬ 
eled based on a small variation of the existing network 
growth models. A comprehensive study of model param- 
eterizations that best agree with real data as well as both 
quantitative and qualitative analysis of various possible 


reasons for the presence of discoverers in real data remain 
as future research challenges. 

Model data show low correlation between user abil¬ 
ity and the average fitness of items collected by them. 
This seemingly ordinary finding has far-reaching impli¬ 
cations because it contradicts the basic assumptions of 
many network-based ranking algorithms such as PageR- 
ank and HITS. We show that while a traditional ranking 
algorithm indeed performs poorly on the model data, the 
newly developed user surprisal metric works well due to 
the fact that it takes the system’s complete time evolu¬ 
tion into account, not just the final state. This indicates 
the need for new algorithms that act on detailed network 
representations with full information about the creation 
time of all nodes and links. A similar observation in the 
context of diffusion in temporal networks [52] indicates 
that the suggested approach is a general one: evolving 
networks require temporal methods. 

We stress again that the classical concepts of social 
leaders or innovators who have high social status or are 
well positioned in the social network, extensively studied 
in the past [2Ql EH EH Eg, do not provide a full expla¬ 
nation for the presence of discoverers who do not share 
any advantageous or privileged position and achieve dis¬ 
coveries consistently over time. Our work demonstrates 
the presence of discoverers in social systems and at the 
same time calls for a deeper understanding of their be¬ 
havior and roles. To quantify the level to which a user’s 
discovery performance is due to some external influence 
(for example, a minority of the identified high surprisal 
users in the Amazon data are members of the Amazon’s 
Vine Voice program which gives them advance access to 
not-yet-released products) is just one of the steps towards 
understanding the phenomenon of discoverers. 

Our results provoke several further questions. The null 
hypothesis of the discovery probability independent of 
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time assumes the real discovery rate is constant. While 
Figures S3 and S4 show that this requirement is ful¬ 
filled in the studied datasets, a general framework where 
the system’s timespan is divided into multiple bins and 
Pd is computed for each bin separately could render 
the requirement unnecessary. We have already discussed 
that the proposed statistical framework is robust to sub- 
optimal choice of the set of relevant items, which is cru¬ 
cial for its relevance in any realistic setting. However, we 
may still attempt to improve the initial choice on the ba¬ 
sis of the computed values of user surprisal. For example, 
we can replace the popular items that are actually ne¬ 
glected by the users of high surprisal with the items that 
are better received by them. User surprisal can be then 
recomputed using the updated group of relevant items, 
and by repeating the described steps eventually obtain 
a closed self-consistent system where relevant items are 
those that are collected by high surprisal users and high 
surprisal users are those who early discover the relevant 
items. Regarding the proposed network growth model, 
an extensive set of dynamical and statistical measure¬ 
ments is needed to determine how well the model rep¬ 
resents the relevant features of real systems in compar¬ 


ison with other existing network models. For example, 
if we want to produce realistic discovery patterns, is it 
necessary to assume that the popularity-driven users are 
wholly ignorant of item fitness? At a more basic level, 
the knowledge of user surprisal values gives us for the 
first time the possibility to discriminate between simi¬ 
larly popular items purely based on their success among 
users of different surprisal. This and the potential use 
of discoverers for predicting the future success of items 
illustrated in Figure 3 are the first hints of our work’s 
potential applications in e-commerce and marketing in 
general. 
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Supplementary Information 


Appendix A: Data description 

We analyzed six different real data sets. The description and results for the first two, Amazon and Delicious, are 
presented in the main text. Here follows a brief description of the other four data sets. The corresponding results are 
presented in Figure S2. 

1. We downloaded the Epinions . com consumer review data from konect .uni-koblenz.de/networks/, The orig¬ 
inal data comprise 120,492 users, 755,760 items and 13,668,320 ratings. Time span of the data is from 9 January 
2001 to 29 May 2002. In the raw data, the time stamps exhibit a periodic pattern with respect to link order. 
In addition, many links appear at the starting day of the data. To avoid these two problems, we use only links 
ranked from 12,276,827 to 13,213,749 in the original data. Since ratings are in the integer scale from 1 to 5, 
we apply the same threshold mechanism as in the Amazon data. The final subset contains 17,542 users, 32,482 
items and 753,392 links. Time span of the subset is from 16 January 2001 to 29 May 2002 (499 days in total). 

2. Keyword data from the biggest Chinese online shopping website taobao.com were crawled via open API from 
the web site. In the Taobao e-commerce platform, vendors can use keywords to describe their products and 
well-chosen keywords can contribute to their products being ranked at the top of customers’ search results. 
At the same time, vendors have to pay a price for using keywords and the price of a keyword depends on the 
keyword’s popularity—vendors thus have an incentive to invent new keywords or early adopt already existing 
keywords. The data comprise 2,824,853 links between 1,523 online retailers and 915,271 keywords that they 
attached to their products. Time span of the data is from 12 November 2009 to 21 June 2014 (40,360 hours in 
total). 

3. Movielens movie rating data were obtained from grouplens.org/datasets/movielens/, The original data 
comprise 10,000,054 ratings from 71,567 users to 10,681 movies in the online movie recommender service 
MovieLens. Since ratings are in the integer scale from 1 to 5, we apply the same threshold mechanism as in the 
Amazon data. Time span of the data is from January 1995 to January 2009 (122,634 hours in total). We use 
the subset from hour 40,000 until the end of the data to avoid an initial period of low user activity. The final 
subset contains 2,132,128 links between 44, 548 users and 7, 974 items. 

4. Netflix DVD rating data were made available for the Netfiixprize contest and can be still downloaded from 
www.netf lixprize. com/. The original data comprise 100,481, 826 ratings from 480,189 users to 17, 770 movies 
in the online DVD rental website Netflix. Since ratings are in the integer scale from 1 to 5, we apply the same 
threshold mechanism as in the Amazon data. Time span of the data is from January 2000 to January 2006 
(2242 days in total). We use the subset from day 500 to day 1,500 to constraint the data size. The final subset 
contains 2, 775, 772 links between 115,131 users and 7, 351 items. 

When taking a subset, we only include objects that have not appeared before the subset’s beginning, such that we 
do not misrepresent some later links for discoveries (if item X first appear at time 100 and collects many links until a 
subset’s beginning, we do not want to award discoveries to the first links to item X within the subset because they are 
actually no discoveries). Creating a subset still means that we include only a part of each user’s links. As a result, a 
user could be a good discoverer in our subset but a less good one before or after. However, this is unbiased sampling 
from a user’s links which does not give undue advantage to anyone. 


Appendix B: Evaluation of future degree evolution 

We choose subsets of time span Ts by choosing their starting time Tx at random from the range [0 ,Tw — Ts—Tp) 
where Tw is the time span of the whole dataset and Tp is the length of the future time window, over which we 
observe the future degree increase of items (see the diagram below). Each subset contains only items that have not 
been collected before the subset’s starting point (if we would include those items, we could assign false discoveries for 
them despite the fact that those items might have already collected a substantial number of links before the subset’s 
start). A given subset is then used to compute surprisal of all its users. We further choose all items that have received 
exactly one link and they have appeared at most Tmax before the subset’s end time (this represents young and yet 
unpopular items) as items of interest, r^iax is 20 and 2 days for the Amazon and Delicious data, respectively, which 
accounts for different dynamics of these two systems. 
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We then track all links that are attached to the items of interest in the future time window of length Tp (i.e., these 
links are not part of the subset which was used to compute user surprisal values). This allows us to compute the 
average degree of these items as a function of time. Results are further averaged over 100 subsets defined by their 
Tx value. In Figure 2 in the main text, we plot the average degree increase computed separately for items of interest 
collected by three distinct user groups: users of zero surprisal, users of low (less than ten) surprisal and users of high 
(ten and more) surprisal. The chosen parameters of this evaluation procedure are summarized in Table S3. While 
their values influence the detailed shape and relative height of the curves reported in Figure 2, the main result of 
items collected by high surprisal users being more popular than items collected by users of zero or low surprisal holds 
always. 


0 




\ 








subset future 

A diagram of the subset creation for the evaluation of future degree evolution. 


Appendix C: Analysis of the Yelp data 

We now address the question of whether the observed discovery patterns can be explained by the social influence 
of users. To this end, we use a dataset from the Yelp academic challenge, round 4 (see http://www.yelp.com/ 
academic-dataset for more information). The advantage of this dataset is that it features both the bipartite user- 
item network as well as the social user-user network (the Delicious web site also allowed the users to form friendship 
links but unfortunately we do not have the social network information and thus have to use a new dataset). The 
input data contains 252,898 users, 42,153 items (which in this case represent businesses), 955,999 friendship links, 
and 1,125,458 reviews in the integer scale from 1 to 5; the time stamps run from 0 to 3558 (measured in days). We 
only keep the users who have at least one friend and one authored at least one review. As for the other datasets, we 
use the rating threshold of four and focus on a subset of the data (in this case the the evaluations from days 1000 
until 3499; we thus ignore the rather long initial period of 1000 days which aims at avoiding the notorious items that 
existed before day 0 and awarding discoveries for them would therefore be unjust). We finally have a dataset with 
80,840 users, 33,661 items, 348,060 user-item links and 674,231 directed user-user links. 

As in the other reported datasets, also the Yelp data features discoverers: the largest user surprisal value is 21, the 
average highest surprisal in bootstrap realizations is 9.7, and the number of identified discoverers is 30. A comparison 
of the set of 100 highest surprisal users with the set of 100 most social (as measured by the number of friends) users 
reveals that the two sets share only one user and even this user actually does not pass the bootstrap surprisal threshold 
(the user’s surprisal value is thus not statistically significant and could happen by chance). We can conclude that 
in the Yelp data, users with many social contacts are in now way more successful in achieving discoveries than users 
with few social contacts. 
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SUPPLEMENTARY TABLES 


Dataset 

Users 

Items Links 

Time span 

ki 

Ki 

ka 

Kc 

kn 

Amazon 406,275 

76,205 713,581 

3,000 days 

1.8 

1,296 

9.4 

790 

127 

Delicious 

107,810 2,435,912 9,322,949 20,000 hours 

86.5 

6,582 

3.8 

7,014 

40 

Epinions 

17,542 

32,482 753,392 

499 days 

42.9 

5,809 

23.2 

508 

154 

Keyword 

1,523 

915,271 2,824,853 40,360 hours 

1855 23,775 

3.1 

227 

32 

Movielens 

44,548 

7,974 2,132,128 82,624 hours 

47.9 

2,419 

267 

18,858 3,852 

Netflix 115,131 

7,351 2,775,772 

1,000 days 

24.1 

959 

378 26,700 8,256 


Table SI: Basic statistical properties of the studied datasets. The time span column specifies both duration 
and time resolution of the datasets, ki and are the mean user and item degree, respectively. Ki and are the 
largest user and item degree, respectively, /cd is the smallest degree upon which an items is considered as one of items 
that are to be discovered when /d = 1%. 


Amazon Delicious 


Rank 

ki 

di 

Ti 

po 

Si 

Rank 

ki 

di 

Ti 

po 

Si 

1 

188 

59 

51.6 

10-82 

187.6 

1 

3556 

195 

3.9 

10-56 

127.9 

2 

244 

50 

33.7 

10-69 

135.3 

2 

768 

88 

8.2 

10-60 

115.1 

3 

217 

35 

26.5 

10-38 

86.4 

3 

3835 

192 

3.6 

10-49 

112.7 

4 

237 

26 

18.0 

10-24 

54.4 

4 

2124 

136 

4.6 

10-47 

106.8 

5 

190 

24 

20.8 

10-24 

53.8 

5 

2625 

141 

3.9 

10-40 

91.0 

6 

364 

26 

11.7 

10-19 

43.5 

6 

894 

82 

6.6 

10-40 

90.7 

7 

185 

18 

16.0 

10-16 

36.1 

7 

639 

69 

7.7 

10-38 

87.0 

8 

73 

11 

24.8 

10“^^ 

27.6 

8 

1019 

85 

6.0 

10-38 

86.7 

9 

41 

9 

36.1 

10-12 

26.4 

9 

4585 

185 

2.9 

10-35 

80.2 

10 

60 

10 

27.4 

10-12 

26.2 

10 

395 

47 

8.5 

10-28 

64.2 

11 

12 

6 

82.2 

10-11 

23.8 

11 

1060 

73 

4.9 

10-2" 

62.8 

12 

42 

8 

31.3 

10-10 

22.4 

12 

1177 

73 

4.4 

10-25 

56.6 

13 

432 

18 

6.8 

10-10 

21.7 

13 

1116 

71 

4.6 

10-25 

56.5 

14 

47 

8 

28.0 

10-10 

21.5 

14 

808 

59 

5.2 

10-24 

54.1 

15 

99 

10 

16.6 

10-® 

21.1 

15 

2355 

104 

3.2 

10-23 

52.7 

16 

31 

7 

37.1 

10-® 

21.1 

16 

622 

50 

5.8 

10-22 

50.3 

17 

51 

8 

25.8 

10-® 

20.8 

17 

573 

48 

6.0 

10-22 

50.2 

18 

35 

7 

32.9 

10-® 

20.1 

18 

292 

35 

8.6 

10-21 

48.6 

19 

23 

6 

42.9 

10-® 

19.2 

19 

75 

21 

20.1 

10-21 

48.3 

20 

71 

8 

18.5 

10-® 

18.1 

20 

1580 

78 

3.5 

10-20 

46.4 


Table S2: Twenty users with the highest surprisal in the Amazon and Delicious data, ki is the degree of 
user i, di is the number of discoveries by user i, := dij^poki) is the ratio between the actual number of discoveries 
di and the number of discoveries expected under the null hypothesis poki^ Pf is the probability that user i makes at 
least di discoveries under the null hypothesis, and finally Si := — InP? is the corresponding surprisal (cf. Equations 
(1) and (2) in the main text). 


Dataset Tw Ts Tp ^max 

Amazon 3,000 days 2,000 days 100 days 20 days 
Delicious 20,000 hours 7,200 hours 2,400 hours 480 hours 

Table S3: Parameters of the future degree evaluation procedure. Tw is the time span of the whole dataset, 
Ts is the time span of subsets, Ti? is the future time window in which the degree of items of interest is observed, and 
^max is the maximal age of an item of interest in a given subset. 
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Figure SI: Collection patterns of users in the Delicious data. Similarly as Figures lA and IB in the main text 
show the collection patterns of different users in the Amazon data, we do here the same for the Delicious data. Each 
bar again corresponds to an object collected at a given time point with the red and blue part indicating the item’s 
degree at the moment when it was collected by a given user and the final degree, respectively. We show here data for 
three users with the highest surprisal and a randomly chosen active user without discoveries. 




Figure S2: Bootstrap results for other data sets. In analogy with Figure 1 in the main text, we show here Zipf 
plots of real and bootstrap surprisal values for four additional data sets (Epinions, Taobao keyword data, Movielens, 
and Netfiix subsets). We do not indicate here the standard deviations of the bootstrap surprisal values because they 
are small—around 1 for the top-ranked user and it decreases quickly with user rank. 
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Figure S3: Temporal distribution of discoveries. We investigate here whether discoveries and surprisal are not 
strongly biased towards, for example, the early period of the analyzed data when the number of items was small and 
thus it was maybe easier to make discoveries. To this end, we show the Zipf plots of the relative entrance time of 
target items (the relative entrance times of 0 and 1 correspond to the beginning and end of the data set). Nearly 
straight lines in panels A and B indicate that target items are distributed rather uniformly through the data time 
span. Panels C and D show the mean discovery time (again in relative units) of individual users plotted the number 
of discoveries achieved by them. The dotted line represents the mean discovery time averaged over all users. The 
dashed line represents a linear fit of the data for individual users in the log-linear plane. Panels E and F show the 
mean discovery time of individual users against their surprisal. See Figure S4 for more detailed information about 
the discovery patterns of top 20 users in both number of discoveries and surprisal. 




















17 



user rank in surprisal user rank in surprisal 


Figure S4: Temporal distribution of discoveries for top 20 users. To uncover possible time bias in the discovery 
patterns of users, we show here box plots for discovery times by individual users who are ranked among the top 20 
users either in number of discoveries (A, B) or in surprisal (C, D). The boxes represent the first and third quartile of 
the discovery times for each individual user; the bands represent median values; the whiskers represent the minimum 
and maximum values. One can see here that discoveries are spread over a substantial time period for majority of top 
users with only a few users achieving a substantial fraction of their discoveries at the very beginning of the data (the 
only exceptions are user #19 in surprisal and user #12 in number of discoveries, both in the Delicious data). We 
can conclude that high numbers of discoveries and surprisal achieved by some users are not due to their privileged 
position. 
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Figure S5: Stability of user surprisal values with respect to parameters. In the main text, we choose = 1% 
and Nd = 5 for both Amazon and Delicious data. To investigate the effect of these two parameters on user surprisal 
values, we first compute the vector of user surprisal values for and denote it S'*. We then calculate the vector of 
surprisal values S for any different /d and compute the Pearson correlation coefficient r(S*, S) which is then shown 
in panels A and B. The procedure is the same in panels C and D, except for the computation of Pearson correlation 
coefficient only over users whose surprisal in S* is greater than 10 (we focus in this way on users who matter most 
from the perspective of their discovery ability). Results for various values of Nj:f (recall that Njj first links attached 
to a target item are marked as discoveries) are shown here. When Njj = 1, surprisal values are more sensitive to 
changes of fn because the information used to compute surprisal is then rather limited. Results (panels C and D in 
particular) show that surprisal values are rather robust: increasing or decreasing /d by the factor of two still yields 
correlation values above 0.9 for both Amazon and Delicious data. 
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Figure S6: Stability of user surprisal values with respect to the data. We compute here the vector of user 
surprisal values S'* using the full data and measure its correlation with the vector of user surprisal values S obtained 
on data that ends after a certain fraction of links (panels A and B) and data that ends at a certain time (panels C 
and D). As in Fig. S7, the correlation value is computed for all users as well as for users whose surprisal value in S* is 
greater than 10. Results in panels A and B show that even when one half of links is omitted (note that we omit here 
the most recent links, not the oldest), correlation between S'* and S is still around 0.9. Correlations decreases faster 
in panels C and D than in panels A and B because the speed at which new links are added increases with time in the 
studied systems. Setting the end time to Tmax = 1, 500 in the Amazon data thus corresponds to using substantially 
less than T^^^/Tw = 1/2 of all links. 




























